A Semi-Supervised Pattern-Learning Approach to Extract Pharmacogenomics-Specific Drug-Gene Pairs from Biomedical Literature

نویسندگان

  • Rong Xu
  • Quanqiu Wang
چکیده

We develop a semi-supervised pattern learning method to extract drug-gene relationships from free text. Central to our approach is the observation that: the semantic relationship between a drug and a gene can be expressed in many different ways due to the flexibility and expressive nature of human natural language. However, these patterns are not randomly distributed and there are predominant patterns people use to describe specific types of drug-gene relationships. For example, pattern “DRUG is metabolized by GENE” is typically used to describe metabolism relationship between a drug and a gene. Example sentences include “Quetiapine is metabolized by CYP3A4 and sertindole by CYP2D6” (PMID 10422890), and “Cerivastatin is metabolized by CYP2C8 and CYP3A4, and fluvastatin is metabolized by CYP2C9” (PMID 17178259). On the other hand, pattern “GENE inhibitor DRUG” is typically used to express the inhibition relationships between a drug and a gene.Example sentences include “In addition, the effect of the CYP2C9 inhibitor fluvastatin was evaluated using S-warfarin as a probe” (PMID 16758259) and “The CYP2C8 inhibitor gemfibrozil does not increase the plasma concentrations of zopiclone” (PMID 16832679). In this paper, we use two seed patterns for two types of drug-gene relationship extraction: seed “DRUG is metabolized by GENE” for drug-gene metabolism relationship (i.e. quetiapine-CYP3A4, cerivastatin-CYP2C8) extraction and the seed “GENE inhibitor DRUG” for drug-gene target relationship (i.e. fluvastatin-CYP2C9, gemfibrozil-CYP2C8) extraction. First, we use the seed patterns to find their associated drug-gene pairs. Then we iteratively learn new patterns that are associated with the extracted drug-gene pairs and extract corresponding drug-gene relationships from the newly discovered patterns. The iterative process stops when no additional good patterns are found.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Unsupervised Text Mining Method for Relation Extraction from Biomedical Literature

The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies inte...

متن کامل

Graph-based Semi-supervised Gene Mention Tagging

The rapidly growing biomedical literature has been a challenging target for natural language processing algorithms. One of the tasks these algorithms focus on is called named entity recognition (NER), often employed to tag gene mentions. Here we describe a new approach for this task, an approach that uses graphbased semi-supervised learning to train a Conditional Random Field (CRF) model. Bench...

متن کامل

A semi-supervised efficient learning approach to extract biological relationships from web-based biomedical digital library

Many biological results are published only in plain–text documents and these documents or their abstracts are collected in web-based digital libraries such as PubMed and BioMed Central. To expedite the progress of functional bioinformatics, it is important to efficiently process large amounts of these documents, to extract these results into a structured format, and to store them in a database ...

متن کامل

Finding small molecule and protein pairs in scientific literature using a bootstrapping method

The relationship between small molecules and proteins has attracted attention from the biomedical research community. In this paper a text mining method of extracting smallmolecule and protein pairs from natural text is presented, based on a semi-supervised machine learning approach. The technique has been applied to the complete collection of MEDLINE abstracts and pairs were extracted and eval...

متن کامل

Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013